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ABSTRACT 

The Disease Ontology (DO) database (http:// 
disease-ontology.org) represents a comprehensive 
knowledge base of 8043 inherited, developmental 
and acquired human diseases (DO version 3, 
revision 2510). The DO web browser has been de- 
signed for speed, efficiency and robustness through 
the use of a graph database. Full-text contextual 
searching functionality using Lucene allows the 
querying of name, synonym, definition, DOID and 
cross-reference (xrefs) with complex Boolean 
search strings. The DO semantically integrates 
disease and medical vocabularies through extensive 
cross mapping and integration of MeSH, ICD, NCI's 
thesaurus, SNOMED CT and OMIM disease-specific 
terms and identifiers. The DO is utilized for disease 
annotation by major biomedical databases (e.g. 
Array Express, NIF, IEDB), as a standard represen- 
tation of human disease in biomedical ontologies 
(e.g. IDO, Cell line ontology, NIFSTD ontology, 
Experimental Factor Ontology, Influenza Ontology), 
and as an ontological cross mappings resource 
between DO, MeSH and OMIM (e.g. GeneWiki). The 
DO project (http://diseaseontology.sf.net) has been 
incorporated into open source tools (e.g. Gene 
Answers, FunDO) to connect gene and disease bio- 
medical data through the lens of human disease. 
The next iteration of the DO web browser will inte- 
grate DO's extended relations and logical definition 
representation along with these biomedical 
resource cross-mappings. 

INTRODUCTION 

From ancient texts such as the Eshuma Code of Babylon 
in the 23rd century BC (1) to the experimental results 



reported in literature today, scientists have docu- 
mented variation in human health in order to unravel 
the mystery of disease. Diagnostic evaluation, treatment 
and data comparisons over time and between studies can 
be greatly facilitated by semantically consistent annota- 
tions such as those available through the Disease 
Ontology (DO). 

The research and clinical communities have developed 
and utilized a variety of vocabularies in order to system- 
atically record mortality and morbidity classifications, to 
standardize clinical and event healthcare reporting, to 
index Medline articles or to interconnect biomedical con- 
cepts defined across hundreds of disparately developed 
vocabularies, coding systems, thesauri and classifications. 
Although these vocabularies and ontologies include 
disease and disease related concepts and terms, none of 
them are 'organized 1 around the concept of disease. 

The DO was developed to create a single structure for 
the classification of disease which unifies the representa- 
tion of disease among the many and varied terminologies 
and vocabularies into a relational ontology that permits 
inference and reasoning of the relationships between dis- 
ease terms and concepts and is optimized toward 
annotating disease. 

The DO aims to provide a clear definition for each dis- 
ease within an etiological based classification of disease 
enabling their consistent use and application for annotat- 
ing biomedical data. The DO addresses the complexity of 
disease nomenclature through the inclusion of MeSH, 
OMIM, ICD and SNOMED CT concept names and 
IDs. The DO web browser will provide a framework for 
data mining, reasoning and inference enabling the explor- 
ation of biomedical disease and gene data for ongoing 
research and novel discovery based on the shared repre- 
sentation of disease. In this report, we present the new DO 
database and web browser (http://disease-ontology.org) 
(Figure 1), a description of the DO's semantic integration 
activities, data updates and the DO's development 
directions. 
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Figure 1. DO web interface with search, navigation and display functions. The disease tree view displays the DO's hierarchical structure 
placement of top level parent nodes expandable to view subgraphs. The fungal infectious disease subgraph with its direct child terms are 
Term Metadata is displayed for selected terms from the tree view. 



SCOPE OF THE DO 

The DO is an open source ontological description of 
human disease, organized from a clinical perspective of 
disease etiology and location. Providing the classification 
framework for a disease 'Rosetta Stone' was a driving use 
case for starting the Disease Ontology in 2004 (2,3). 

The initial builds of DO in 2003 and 2004 used ICD-9 as 
the foundational vocabulary. These early versions were 
extensively reorganized by process, system affected and 
cause (genetic disorders, infectious diseases, metabolic 
disorders). Further revisions improved with the re- 
organization of DO based on UMLS disease concepts in 
conjunction with term concept mappings to SNOMED 
CT and ICD-9. 

The DO has become a community-driven, open and 
extensible framework for capturing human disease know- 
ledge through direct and indirect semantic relationships. 
The DO enables the exploration of datasets and data re- 
sources through disease mappings available in clinical, 
gene and genome study metadata. This exploration lever- 
ages the semantic richness embedded in the DO. DOs 
directed acyclic graph (DAG) present terms linked by 
computable relationships in a hierarchy (e.g. brain glio- 
blastoma multiforme is_a brain glioma, and brain glioma 
is_a brain cancer) organized by interrelated subtypes (e.g. 
Brill-Zinsser disease is_a epidemic typhus, and epidemic 
typhus is_a typhus). The DO is organized into eight main 
nodes to represent cellular proliferation, mental health, 
anatomical entity (e.g. cardiovascular system disease), in- 
fectious agent (e.g. anthrax), metabolism and genetic 
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diseases along with medical disorders and syndromes 
anchored by traceable, stable identifiers (DOIDs). 

The DO project continues to improve and expand the 
representation of all human disease with the addition of 
new DO terms as needed for curation, term requests and 
collaborative development. Rare diseases, for example, are 
currently underrepresented in DO. Curatorial efforts are 
underway to deepen DO's representation and to expand 
our standard is_a relations in the DO logical definition 
(HumanDOxp.obo) file. The additional logical definition 
file format connects disease terms with related ontological 
concepts (e.g. anatomy, phenotype, disorder, cell type). 
The HumanDO xp.obo file is available from DO's 
SourceForge site and includes additional relationships 
for 931 DO terms. 

The DO provides ongoing documentation via the DO 
wiki (http://diseaseontology.sf.org), DO Facebook 
(http://www.facebook.com/group. 

php?gid = 130516806961828), DO Linkedln (http://www 
.linkedin.com/groups?gid = 3078180&trk = anetsrch_ 
name), DO twitter postings (http://twitter.eom/#l/ 
diseaseontology) and the DO website (http://disease- 
ontology.org/about). 

Ontological disease definition 

An ontological definition of disease enables each type (or 
class) of disease to be singularly classified in a formalized 
structure. The ontological distinction of disorder, dispos- 
ition and disease as a realized disposition have been 
clarified by the development of the upper level 
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organizational Basic Formal Ontology (BFO, http://www 
.ifomis.uni-saarland.de/bfo/) and the Ontology of General 
Medical Science (OGMS, http://www.acsu.buffalo.edu/ 
~ag33/ogms.html) along with discussion of ontological 
realism for mental disease (4) and the treatment of 
disease and diagnosis (5). Encompassing clinical descrip- 
tors of disease, the Disease Ontology has clarified DO's 
ontological scope with the adoption of the OGMS onto- 
logical definition of disease, 'A disposition (i) to undergo 
pathological processes that (ii) exists in an organism 
because of one or more disorders in that organism'. 
Within this context, DO describes the attributes of 
disease as manifested in individuals. 



DO SEMANTIC INTEGRATION 

The breadth of immune system, bone, mental, genetic and 
infectious disease subtrees in DO have been broadened 
through collaborative efforts with the DO team improving 
DO to meet the needs of our community. The DO project 
has provided the ontological framework for uniform data 
management and consistent annotation of human disease 
terms in biomedical databases and ontologies. 

DO terms and their DOIDs have been utilized to 
annotate disease concepts in several major biomedical re- 
sources. The Rat Genome Database (6) (RGD) annotates 
their rat and mouse gene records and rat QTLs that are 
animal models of human disease with DO's human disease 
terms. The Immune Epitope Database (IEDB) (7) epitope 
records are annotated with 168 DO terms. Annotation of 
the GeneWiki's (8) gene records with 2983 candidate DO 
annotations is underway. Experimental expression records 
(961 1) at the EBI's Array Express (9) have been annotated 
with DO terms representing an extensive resource for 
understanding the relationships between diseases and 
gene function. 

DO continues to be utilized by a growing set of biomed- 
ical ontologies as a standard representation of disease. For 
example, the NCBO's Neuroscience Information Network 
(NIF) Standard ontology [NIFSTD, (10)] has integrated 
DO's representation of 252 mental disorders and neuro- 
logical diseases. Feedback provided by NIF subject matter 
experts continues to improve DO's disease representation. 

DO CONTENT AND STRUCTURE 

The DO is logically structured into major types of disease 
to enable guided expansion of the ontology. The DO is 
being enhanced through the continued efforts to improve 
our representation of textual definitions (1822 textual def- 
initions, 22% of DO terms, DO version 3, revision 2510). 
The DO's stable HumanDO.obo file provides the basis to 
advance DO's representation of the complex relationships 
between disease, disorder and phenotype. DO has begun 
to expand our set of cross-product relations linking DO 
terms to orthogonal ontologies with the annotation of 
disease attributes (e.g. symptom, phenotype, anatomical 
or cellular location and pathogenic agent) with 932 
logical definitions in the DO's logical definition file 
(HumanDO xp.obo) to the Foundational Model of 



Anatomy (FMA) (11), Human Phenotype Ontology 
(HP) (12), NCBI organismal classification vocabulary (13), 
Transmission Process ontology, Symptom Ontology 
(14), PATO (15), GO (16) and Cell Type ontology (17). 
Expansion of DO's set of relations in the HumanDO_ 
xp.obo file (transmitted_by, results_in_formation_of, 
reslts_in, realized_by_suppression_with, part_of, located 
_in, has_symptom, has_material_basis_in, derives_from 
and composed_of) (18) will expand the DO's ability to 
define these complex relationships. 

Linking disease terminologies 

DO's extensive cross-mapping and inclusion of concepts 
from the standard clinical and medical terminologies 
[MeSH (19), ICD (20), OMIM (21) and NCI thesaurus 
(22)] into an ontological classification of disease (23) 
provides a rich resource for semantically connecting 
phenotypic, gene and genetic information related to 
human disease. Linking health information and patient's 
electronic health records will be further enhanced through 
the planned harmonization of ICD and SNOMED CT 
terminologies and classification (http://www.who.int/clas- 
sifications/AnnouncementLetter.pdf). 

DO identifies, integrates and connects synonymous 
disease concepts in MeSH, SNOMED CT, OMIM and 
ICD9CM and DO based on each disease term's UMLS 
Concept Unique Identifiers (CUIs). DO updates vocabu- 
lary mappings twice yearly from an extraction of term 
CUI's from the ULMS MRCONSO.RRF vocabulary 
mapping file (Table 1). Through this process, 91% 
(7845) of DO terms (August, 2011) are mapped to 
UMLS CUIs. This represents a 7% reduction of UMLS 
mappings since the May 2010 DO-UMLS mapping reflect- 
ing DO's increased utilization of logical definitions to 
define complex disease relationships which has decreased 
the number of unique DOIDs. For example, the DO 
defines adenocarcinoma as a type of (is_a relationship) 
carcinoma that is derived from epithelial cells which ori- 
ginate in glandular tissue. The DO defines gallbladder 
adenocarcinoma as a type of gallbladder carcinoma. 
These two sets of relationships represent a single 



Table 1. DO UMLS CUI ID mappings 



Vocabulary 


Vocabulary IDs 


DO IDs 




May 2010 


August 2011 


May 2010 


August 2011 


OMIM 


2304 


1389 


1594 


2330 


SNOMED CT 


20 985 


14313 


8054 


5155 


NCI thesaurus 


7249 


4761 


7067 


4858 


MeSH 


3932 


3032 


3921 


3116 


ICD9CM 


6403 


2971 


5757 


3325 



A total of 7845 of 8588 DOID's (91%) in DO version 3, revision 2490 
were mapped to the 2011 UMLS CUIs in August 2011. The number of 
unique vocabulary IDs mapped is given in the center column and the 
number of DO terms mapping to other terminologies through the CUI 
mapping file is presented on the right. Note that a single DO term may 
have multiple matches in a given terminology. The decrease in 
SNOMED CT mappings is a reflection of the increased use of logical 
definitions in DO. 
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parentage for each term as denned in the HumanDO.obo 
file and visualized in the DO web browser. Multiple par- 
entage (multiple is_a relationships) inherited from the 
UMLS vocabularies have been greatly reduced in the cur- 
rent version of the DO. Curatorial efforts are ongoing to 
represent secondary parentage with the creation of 
cross-reference definitions (logical definitions) in the 
HumanDO_xp.obo logical definition file. Logical defin- 
itions provide the opportunity to define the relationship 
between a type of organ cancer (e.g. gallbladder adeno- 
carcinoma) and tumor's cell type (e.g. adenoma) as a type 
of adenoma or to define the anatomical location of a 
disease (gallbladder adenocarcinoma is located_in the 
gallbladder. 

We are investigating a technical solution to enable the 
connection of external references through a synthetic term 
derived from the logical definitions. DO's set of OMIM 
cross-references have been validated through manual 
review (Spring 2011) of each disease term. OMIM cross- 
references have been added to DO through this process to 
raise the count of mapped DO-OMIM records to 1630. 

DO WEB INTERFACE 

DO browser database 

The DO web-browser was constructed using Web 2.0 and 
semantic web technologies. At the forefront of these is the 
Neo4j graph database server (http://neo4j.org/). Neo4j 
provides several robust and fast mechanisms to retrieve 
individual nodes or to traverse a set of nodes moving 
between each via their relationships that otherwise would 
require complex join operations in a relational database. 

Built into Neo4j are optimized functions for retrieving 
the (all, shortest, user-defined) path between two terms — 
very common and useful in visualizing term relationships 
in an ontology. The DO browser leverages the RESTful 
API of Neo4j (http://components.neo4j.org/neo4j-server/ 
snapshot/rest. html) to retrieve nodes and their associated 
properties via HTTP ajax calls. This enables power-users 
familiar with the Neo4j RESTful API to request data from 
the Disease Ontology database 'programmatically' and fa- 
cilitates integration into external projects. Currently any 
user may retrieve metadata for a specific DO term by 
making use of our REST metadata API by constructing 
a HTTP request in the following format: http://www 
. disease-ontology. org/api/metadata/<DOID> 

An example scenario would be retrieving the metadata 
for the term 'transient cerebral ischemia': http://www 
. disease-ontology. org/api/metadata/DOID:224 

This query would return a JSON packet containing all 
the metadata for this term including parents, children, 
definition, xrefs, synonym, name, alternate IDs and DO 
identifier. In the future we hope to increase the number of 
API commands available to cover operations such as 
searching and the path between two nodes. 

DO terms are modeled in the Neo4j database with each 
node of the graph being a unique term containing the fol- 
lowing properties: Name, DOID, Definition, Synonym(s), 
Alternate ID(s), Subset(s), Cross-Reference(s) and 
Relationship. The edges of the graph database represent 



relationships between terms in the ontology and have a 
relationship type, positioning the DO browser to enable 
the exploration of term connections by relationships other 
than 'is_a' as the number of logical definitions in DO 
expands. 

Visualization 

User interface. The DO browser was designed with a 
focus on presenting all the ontology tree, query results, 
DO term metadata and visualization on a single page 
with multiple tabs that allows for any metadata, search 
results or visualizations to persist while the ontology is 
further explored. 

The interface consists of HTML and CSS for 
controlling the layout, sizing, fonts and color scheme. 
The tree and visualization components are delivered 
through ajax using the ExtJS (http://www.sencha.com/ 
products/extjs/) and jquery libraries (http://jquery.com/) 
for added GUI elements and functionality. The full-text 
of the DO (name, synonym, definition, xrefs, DOIDs) is 
included in the Apache Lucene (http://lucene.apache.org/ 
java/docs/index.html) index. The Lucene indexing allows 
users to search all or any fields as all text of the ontol- 
ogy is run through an analyzer that removes common 
English stop words and tokenizes the text allowing for 
flexible queries that return results sets containing partial 
hits. 

The layout of the DO Browser can be sub-divided into 
three distinct sections, as seen in the 'DO Tutorial' (http:// 
disease-ontology.org/tutorial/). The 'Search Panel' 
provides all the necessary tools to execute basic or adv- 
anced queries on the DO. The 'Navigation Panel' contains 
a interactive tree-based model enabling navigation and 
exploration of subtrees. The ontology can be traversed 
by a single-click of the arrow found to the left of each 
term or a double-click of a term that is denoted with a 
folder icon. Once expanded any children for a given term 
will be rendered into the tree and can be likewise expanded 
until a leaf node is encountered. The Navigation Panel tree 
is refreshed when a term is selected from search results or a 
Metadata Panel. The 'Content Panel' tabs house search 
results, term metadata and graphic visualization of terms 
and term relationships. Visualization of nodes (Figure 2) 
in the DO Browser can be accessed through the 
'Visualization' button found in the 'Metadata Panel'. 
Invoking the visualization feature of a term will create a 
new tab that will house an interactive canvas upon which 
the target term and any children or parents will be ren- 
dered and explored. By default, terms rendered on the 
canvas in a visualization panel attempt to arrange them- 
selves in a layout that prevents any overlapping. Terms 
that have any associated parents or children will be 
colored green and can be expanded by a single click. 
Upon selecting a term from either the Navigation Panel 
tree or a set of search results a new 'Metadata Panel tab' is 
created to display available term metadata including 
DOID, Name, Definition, Xrefs, Alternateids, 
Synonyms, Relationships and a link to the DO term tracker. 
Where available cross-references (xrefs) and definitions 
will contain links out to the relevant resource. 
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Figure 2. Term visualization in the DO web-browser. The 'Visualize' button on the Metadata page opens a graphical view of DO. Clicking on this 
button will open a new tab that will display the target node of the visualization (e.g. basidiobolomycosis) [red box], parent node [green box] and 
sibling leaf nodes [gray box]. Nodes with five or more children are represented by a gold circle containing the number of children. Clicking on a node 
in the graph will expand the view. 



The 'Search Panel' provides 'full-text' contextual 
searching functionality against all metadata fields (Basic 
Search) and an Advanced Search that allows for the user 
to generate targeted and complex Boolean queries 
against specific fields of the ontology. The Advanced 
Search dialog box facilitates queries with the option to 
'Match AlF (AND) or 'Match Any' (OR) of the query 
terms provided. Each search generates a distinct panel 
allowing for persistence of result sets. 

Comparison with alternative ontology visualization 
services. The DO web browser was developed with 
expanded search and display capabilities for the DO. 
DO's non-flash interactive graphic visualization and full 
text metadata searching is not available from current 
ontology resources [e.g. EBI's Ontology Lookup Service 
(24) and NCBO's BioPortal (25)] and provides accessibil- 
ity with mobile devices such as the iPad. Implementation 
of full-text searching provides users with full access to 
the depth and richness available in the Disease 
Ontology. Alternative services limit their searches 
against the name or synonym field. Furthermore the DO 
Browser provides the ability to create complex searches. 
The DO web browser uniquely provides links to defin- 
itions sources and Xref links to NCI, OMIM, ICD, 
MeSH and SNOMED CT vocabulary terms. These 
features provide DO users with a true semantic linkage 
between disease concepts based on concept identifiers 
rather than a text based matching. Cross-browser and 
cross-platform support was a strong design point for 
the Disease Ontology Browser and reflects that the site 
does not make use of any third-party plugins to render 
content. 



DO SOFTWARE DEVELOPMENT PROJECTS 

Concurrent with the Disease Ontology development, the 
DO group has developed the FunDO and GeneAnswers 
data access and exploration tools. The Functional Disease 
Ontology (FunDO) Web application (http://django.nubic. 
northwestern.edu/fundo/) (3) can be used to measure the 
internal consistency of DO as well as the ability of DO to 
functionally annotate a gene list with disease. FunDO 
takes a list of genes and finds relevant diseases based on 
statistical analysis of the Disease Ontology annotation 
database. 'GeneAnswers' (http://www.bioconductor.org/ 
packages/2. 5/bioc/html/GeneAnswers. html) is a reusable 
bioconductor software package encompassing reprodu- 
cible disease-gene pathway models that can be utilized 
directly by researchers or incorporated into other biomed- 
ical resources (26). GeneAnswers has been downloaded 
2000 times by >1000 members of the bioinformatics com- 
munity since July 2010. 'DOGA' (Human Disease Gene 
Annotation Database) is a beta version tool for examining 
disease-gene annotations through data available in Gene 
Wiki or in the NCBI GeneRIFs (http://doga.nubic 
.northwestern.edu). 



FUTURE DEVELOPMENT 

The Disease Ontology's representation of human disease 
is being advanced through the inclusion of cross- 
references to orthogonal concepts defined by logical def- 
initions in the HumanDO_xp.obo file. Connecting related 
ontological concepts, augmenting DO's relationship types 
and visualizing integrated disease mapping between 
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biomedical resources will broaden the utility of the 
Disease Ontology and DO web-browser. 

Extension of the DO paradigm 

The DO paradigm is extensible for the creation of 
non-human organism disease ontologies. Disease defined 
by etiology and the affected body system are universally 
applicable principles for describing organism-specific 
pathologies in model organism, livestock or plants. 
Defining diseases in DO, we have developed guiding prin- 
ciples to facilitate consistent classification of disease terms. 
For instance, a disease is classified first by etiology, if 
known, with a singular is_a relation. The location of the 
affected body system is annotated in the disease definition 
and then linked by the relation iocated_in' to the corres- 
ponding FMA term. A disease of unknown etiology with 
well defined localization is defined by the affected body 
system. The DO style Guide (http://do-wiki.nubic 
.northwestern.edu/index.php/Style_Guide) outlines the 
curatorial guiding principles for DO. Disease defined by 
etiology and the affected body system are universally ap- 
plicable principles for describing organism-specific 
pathologies in model organism, livestock or plants with 
categorizations of infectious (viral, bacterial, fungal, para- 
sitic), inherited and acquired disease (cancer, metabolism 
and mental health disease). Cross-references to models of 
human disease in DO, as defined by the model organism 
databases, would define the disease to model relationship. 



AVAILABILITY 

Disease ontology files are available under the Creative 
Commons license in three formats: the OBO formatted 
Disease Ontology (HumanDO.obo); the Disease 
Ontology file without cross-references (HumanDO_no_ 
xrefs.obo); and an enhanced Disease Ontology file con- 
taining logical definitions to orthogonal OBO Foundry 
ontologies (HumanDOxp.obo). The HumanDO.obo file 
is available from SourceForge (http://diseaseontology.svn 
.sourceforge.net/viewvc/diseaseontology/trunk/Human 
DO.obo) and can be downloaded from the OBO Foundry 
(http://www.obofoundry.org/cgi-bin/detail.cgi7id = dis 
ease_ontology). DO is available in OWL format at the 
University of California at Berkeley (http://www 
.berkeleybop.org/ontologies/owl/DOID). The Disease 
Ontology web app source code will be made freely avail- 
able at: https://github.com/IGS/disease-ontology. The 
Disease Ontology can also be browsed in EBIs Ontology 
Lookup Service (24) and NCBO's BioPortal (25). 
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